Skills: Visualizing model results

You can use your model results to generate predictions for hypothetical cases, and this can be a useful way to visualize your model results.

Initial setup

I’ll start by loading my data and estimating my preferred model

R

For the R examples on this page, I’ll start by loading the data and the packages I need, and by estimating my preferred model and saving it as an model object called my_model.

library(tidyverse)
library(here)

my_data <- here("datasets",
     "test.csv") |>
  read_csv() |>
  mutate(building = factor(building,
                           levels = c("Single-family",
                                      "Duplex",
                                      "Three-plus")))

my_model <- lm(log2(`sales-price`) ~ building + 
                                     rooms + 
                                     sch_quality + 
                                     log2(city_dist) + 
                                     sch_quality*building, 
               data = my_data) 

summary(my_model)
## 
## Call:
## lm(formula = log2(`sales-price`) ~ building + rooms + sch_quality + 
##     log2(city_dist) + sch_quality * building, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3998 -0.5162  0.0072  0.5125  2.7573 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    16.319563   0.066941 243.791  < 2e-16 ***
## buildingDuplex                  0.011077   0.068136   0.163  0.87086    
## buildingThree-plus              0.066305   0.067652   0.980  0.32706    
## rooms                          -0.001922   0.007547  -0.255  0.79897    
## sch_quality                     0.397331   0.015119  26.280  < 2e-16 ***
## log2(city_dist)                -0.207483   0.010751 -19.299  < 2e-16 ***
## buildingDuplex:sch_quality     -0.073603   0.024252  -3.035  0.00241 ** 
## buildingThree-plus:sch_quality -0.027152   0.024072  -1.128  0.25937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.761 on 9992 degrees of freedom
## Multiple R-squared:  0.1577, Adjusted R-squared:  0.1571 
## F-statistic: 267.2 on 7 and 9992 DF,  p-value: < 2.2e-16

Noticed that I’ve releveled my categorical variable so that “single-family” comes first (so that it will be used as my reference category). My model includes a base-two log transformation for the outcome (home sale price) and one of the predictors (distance to the city). It also includes an interaction between school quality and building type.

Excel

Here are the same model results in Excel:

Generating a set of hypothetical values

To visualize the relationships in your model, it may be helpful to generate a data set where one continuous predictor varies across its full range, and the others are held at their averages.

Generating a sequence of values across a range

You might start by generating a sequence of numbers that covers the full range of of your variables.

R

In my data set, I know that the city distance variable ranges from about a quarter mile to 20 miles, which I can confirm using the summary() function.

summary(my_data$city_dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2731  2.8701  3.9791  4.3751  5.4798 20.0000

So a set of values that covers that full range might start at 0.25 and increase in increments of 0.25 until it reaches a value of 20. I can use the seq() function to generate a set of numbers like this.

vary_dist <- seq(from = 0.25, 
                 to = 20, 
                 by = 0.25)

Now I have that set of 80 numbers in a variable called vary_dist.

vary_dist
##  [1]  0.25  0.50  0.75  1.00  1.25  1.50  1.75  2.00  2.25  2.50  2.75  3.00
## [13]  3.25  3.50  3.75  4.00  4.25  4.50  4.75  5.00  5.25  5.50  5.75  6.00
## [25]  6.25  6.50  6.75  7.00  7.25  7.50  7.75  8.00  8.25  8.50  8.75  9.00
## [37]  9.25  9.50  9.75 10.00 10.25 10.50 10.75 11.00 11.25 11.50 11.75 12.00
## [49] 12.25 12.50 12.75 13.00 13.25 13.50 13.75 14.00 14.25 14.50 14.75 15.00
## [61] 15.25 15.50 15.75 16.00 16.25 16.50 16.75 17.00 17.25 17.50 17.75 18.00
## [73] 18.25 18.50 18.75 19.00 19.25 19.50 19.75 20.00

Excel

In Excel, I can just start a column of numbers that in a sequence, and drag the bottom left corner of the second cell to repeat that pattern until I reach the maximum value I want.

Creating a dataset

Now you can use that set of varying values as part of a data set where everything else isi held constant (perhaps varying the categorical variable as well).

R

In R, you can use the tibble() function to construct a data frame that has your varying values as one column and a constant value in each of the other columns.

vary_dist_single_fam <- tibble(city_dist = vary_dist,
                               rooms = mean(my_data$rooms),
                               sch_quality = mean(my_data$sch_quality),
                               building = "Single-family")

You might want to do this for each of the categories in your categorical variable.

vary_dist_duplex <- tibble(city_dist = vary_dist,
                               rooms = mean(my_data$rooms),
                               sch_quality = mean(my_data$sch_quality),
                               building = "Duplex")

vary_dist_triplex <- tibble(city_dist = vary_dist,
                               rooms = mean(my_data$rooms),
                               sch_quality = mean(my_data$sch_quality),
                               building = "Three-plus")

And then combine them all using the rbind() function.

vary_dist_all <- rbind(vary_dist_single_fam,
                       vary_dist_duplex,
                       vary_dist_triplex)

Excel

Doing something similar in Excel is fairly straightforward and just involves some copying and pasting.

Predict the outcome

Now you want to apply your model to predict the outcome for each of your hypothetical cases.

R

In R, the predict() function is useful for this. You can specify that you want an estimate as well as a 96-percent confidence interval

vary_dist_predict <- predict(my_model,
                             newdata = vary_dist_all,
                             level = 0.95,
                             interval = "confidence") |>
  as_tibble()

Here are the first few rows of the result:

head(vary_dist_predict)
## # A tibble: 6 × 3
##     fit   lwr   upr
##   <dbl> <dbl> <dbl>
## 1  17.8  17.7  17.9
## 2  17.6  17.5  17.7
## 3  17.5  17.4  17.5
## 4  17.4  17.3  17.4
## 5  17.3  17.3  17.4
## 6  17.3  17.2  17.3

The “fit” column is the predicted value for the outcome. “lwr” and “upr” are the lower and upper values for the 95-percent confidence interval, respectively.

You will want to use cbind() to attach these predictions to the dataset you used to generate them.

vary_dist_predict <- cbind(vary_dist_predict, vary_dist_all)

Now you have the predictors and the predicted outcomes in one place.

head(vary_dist_predict)
##        fit      lwr      upr city_dist  rooms sch_quality      building
## 1 17.79712 17.71065 17.88359      0.25 6.2382      2.7045 Single-family
## 2 17.58964 17.52329 17.65599      0.50 6.2382      2.7045 Single-family
## 3 17.46827 17.41336 17.52318      0.75 6.2382      2.7045 Single-family
## 4 17.38216 17.33509 17.42923      1.00 6.2382      2.7045 Single-family
## 5 17.31536 17.27412 17.35660      1.25 6.2382      2.7045 Single-family
## 6 17.26079 17.22407 17.29751      1.50 6.2382      2.7045 Single-family

In this example, the first few predicted values are all around 17 or so. This seems low for a predicted home sale price, right? That’s because I log-transformed the outcome in my model, so the fitted values are the log of the home sale price. To get predicted home sale prices, I would want to undo that log. Since I used a base-two log, I undo it by taking two raised to the power of the predicted value.

vary_dist_predict <- 2^predict(my_model,
                               newdata = vary_dist_all,
                               level = 0.95,
                               interval = "confidence") |>
  as_tibble() |>
  cbind(vary_dist_all)

head(vary_dist_predict)
##        fit      lwr      upr city_dist  rooms sch_quality      building
## 1 227754.8 214505.4 241822.6      0.25 6.2382      2.7045 Single-family
## 2 197246.3 188380.7 206529.2      0.50 6.2382      2.7045 Single-family
## 3 181331.4 174559.4 188366.1      0.75 6.2382      2.7045 Single-family
## 4 170824.5 165341.1 176489.8      1.00 6.2382      2.7045 Single-family
## 5 163095.9 158499.8 167825.3      1.25 6.2382      2.7045 Single-family
## 6 157041.5 153094.9 161089.8      1.50 6.2382      2.7045 Single-family

Now my fitted values are in units of dollars.

Excel

In Excel, you generate a prediction using a formula that refers to your hypothetical observed values and your model coefficients.

I’m not aware of a simple way to generate confidence intervals for predicted values in Excel, so don’t worry about those if you’re working in Excel.

Visualizing predictions

These predictions can make it a little easier to explain your model results, especially when you have included interactions and/or non-linear relationships.

R

In R, you can use the geom_line() function to generate a line graph and, if you want, add geom_ribbon() to illustrate the confidence interval.

ggplot(vary_dist_predict) +
  geom_line(aes(x = city_dist,
                y = fit,
                color = building)) +
  geom_ribbon(aes(x = city_dist,
                  ymin = lwr,
                  ymax = upr,
                  fill = building),
              alpha = 0.3)

Excel

In Excel, if you want to show separate predictions for each category, you will need to rearrange your data a little bit.

And then you can create a chart from that rearranged data.